Four-Dimensional Cellular Automata Acceleration Using GPGPU
Introduction
GPGPU is General Purpose computation on
Graphics Processing Units
—
using a NVIDIA or ATI graphics card to perform non-graphical calculations
for your application. This technique is described in detail
in the book
GPU Gems 2 and on the
www.gpgpu.org web site. That web site
links to the helloGPGPU_GLSL sample code,
which I used as a starting point.
Modern graphics processors have multiple floating-point
pipelines and support programmable operations via shader languages such
as OpenGL
Shading Language. As a sample application I have re-implemented
a four-dimensional Cellular Automata in OpenGL Shading Language. For
an explanation of cellular automata, see
here.
Four-Dimensional Cellular Automata
For this demonstration I am using a continuous-valued cellular automata.
Each cell has 3 floating-point values, (red, green, blue). On each time step
the cell takes on a new value based on the values of its 8 neighbors in
four dimensions (+/- row, col, lvl, wvl). The
OpenGL three-dimensional viewing program allows the
user to choose which W-plane to view, so a three-dimensional slice of
four-dimensional space is displayed. The displayed cubes are translucent,
with their opacity determined by the sum of the (red,green,blue) values. A
white cube with values (1.0, 1.0, 1.0) is opaque.
OpenGL Shading language implementation
OpenGL shading language is similar to C. It has been extended with
primitive datatypes such as two to four-dimensional vectors of rgba or
xyzw. GPGPU computations are done by providing an array of
input data in the form
of a texture. I encoded my four-dimensional 16x16x16x16 space in a
256x256 texture by encoding the R and C dimensions in X as (16*R + C) and
L and W in Y as (16*L + W). Floating point RGB textures are already
supported so no encoding of the color data was necessary. The 4th channel
of the color was filled with 0 or 1 to indicate whether the cell should
always be blank. The 0th and 15th cell in each dimension was kept blank
to prevent the automata from wrapping around the edges.
To perform the computation the input texture is applied to a quadrilateral
and rendered in the frame buffer. As the texture is applied the custom
shader — our cellular automata shader — is applied to each pixel. The
shader code can be seen here, and the
entire program is available here. The framebuffer
is then copied back over the source texture. If desired the frame buffer
can also be copied back out to the application program. This data is
converted back into four-dimensional form for display in three dimensions.
Alternately the two-dimensional contents of the frame buffer can be viewed
directly. The following images are the two-dimensional texture
representations of the above images:
Performance
The ultimate goal is of course to achieve better algorithm performance
than can be achieved on a standard CPU. I have compared the performance
of the 16x16x16x16xRGB four-dimensional cellular automata program
implemented both
on the CPU and the GPU. I have compared it both returning the data to
the CPU for display in three dimensions, and retaining it in the GPU
for direct two-dimensional display as a texture (the CPU 2-D case is
not actually displayed). All performance measurements are on
a 1.73 GHz AMD Athlon CPU and an ATI Radeon 9800 Pro GPU. Debian Linux,
2.4.26 Kernel, XFree86 4.3, ATI 8.12.10-1 drivers from
here.
| steps / second |
| 1 CA step / frame |
10 CA steps / frame |
100 CA steps / frame |
1000 CA steps / frame |
| CPU 3D |
193 |
203 |
204 |
204 |
| GPU 3D |
16 |
144 |
485 |
637 |
| CPU 2D |
195 |
203 |
204 |
204 |
| GPU 2D |
599 |
645 |
653 |
657 |
|
|
data revised to remove most display overhead
The major limitation on the 3-D GPU performance is the
glReadPixels() call
to move the texture data from the GPU back to the CPU. In the 10 and 100 steps/frame
cases the cellular automata is updated (stepped forward) 10 or 100 times before
the data is copied back to the CPU. In the 2-D case the data is never copied
back from the GPU to the CPU, since it is displayed directly.
Conclusion
General-purpose algorithms run on the GPU can outperform the CPU if movement
of data onto and off of the GPU can be minimized. Ideally the algorithm
should be able to iterate many times before exchanging data with the CPU.
At present facilities for GPGPU programming are primitive. Higher-level
C-like languages such as OpenGL Shading Language are available, but
debugging is difficult. It is also difficult to determine what code
modifications will enhance or destroy pipeline performance — much
trial-and-error is required.
Special Thanks to Paul Gyugyi